Parallel Programming/Lesson 4
- Fundamental GPU algorithms
- Measure complexity
- Step complexity ("stages")
- Work complexity ("actual nodes in the tree")
- Efficient parallel algorithm can help reduce step complexity
- Work Efficient: if work complexity of the parallel algorithm is asymptotically the same as the serial algorithm
- Reduce
- Parallel implementation can have a smaller number of steps than a serial implementation
- Operator characteristics
- Serial: loop reuses previous iteration
- Parallel: break into multiple trees – regroup to allow multiple operations to run in parallel at the same time
- because it's associative
- Linear work, O(n)
- But O(Logn) steps
- Brent's Theorem
- How much time will it take if we only have p processors
- If max required parallel processors is P; P < p -> nothing to do
- but if P > p, then (P/p) increase in step complexity wherever we're stuck
- Brent's theorem: t + (m-t) / p time to run something that has t steps, m work items, and p processors
- (x + x / 2 + x / 2 + x / 4 + … + 2 + 1) / (x + 1) = 1 + 2(x / 2 + x / 4 + … + 1) / (x + 1)
- Potential improvements for reduction
- Multiple items per thread reduction
- First step of reduction while reading -> writing shared
- Warps are synchronous
- Scan
- Compaction, etc.
- Very useful applications in parallel
- Input array, binary associative operator, identity element
- [a0 a1 a2 … a(n-1)] -> [I a0 a0 + a1 a0 + a1 + a2 … ]
- current formulation ignores last element (n -> n scan)
- Inclusive scan: f(i) = up till, and also including xi
- Exclusive scan: f(i) = up till, but not including xi
- Parallelization
- The pattern in scan is very common – being able to parallelize it pays off significantly
- serial: n operations, n steps
- one approach: reduce every single output (n2 work, logn steps)
- hillist & Guy Steele
- step efficient
- add neighbors 1 to the left / copy over remaining
- copy over 2 to the left / copy over remaining
- repeat
- log n steps, nlogn work
- blelloch – would be good to implement this
- reduce & downsweep
- downsweep starts with an identity operator
- add neighbors 1, 2, 4, etc. hops away
- work efficient
- 2 log n steps, 2 O(n) work – same efficiency as serial implementation
- Choosing between the 2
- More work than processors: should have something that's more work efficient
- more processors than work: should have something that step efficient
- Histogram
- exclusive sum scan
- parallelization requires atomics / synchronization
- reduce the price by making local histograms, and then reducing them in parallel
- Alternatively, Sort & Reduce by Key
- Additional references: